Running Head: Assessing item-level fit for higher-order IRT models 


Assessing Item-Level Fit for Higher Order Item Response Theory Models 


Xue Zhang 
Northeast Normal University 
Chun Wang 
University of Washington 
Jian Tao 
Northeast Normal University 


The project is supported by the National Natural Science and Social Science Foundations of China 
(Grants 11571069) and Institute of Education Sciences grant R305D170042 (originally 
R305D160010). 


Citation: Zhang, X., Wang, C., & Tao, J. (2018). Assessing Item-level fit for higher order item 
response theory models. Applied Psychological Measurement, 42, 644-659. 


® Check for updates 


Article 
Applied Psychological Measurement 
Assessing Item-Level Fit for ye The Author(s) 2018 
e Article reuse guidelines: 
H igh e r Oo rd e r Ite m Res po n se sbi cornlletirnate ge calle 
DOI: 10.1177/0146621618762740 
T h eo ry M od e | Ss journals.sagepub.com/home/apm 
SAGE 


Xue Zhang', Chun Wang? and Jian Tao’ 


Abstract 


Testing item-level fit is important in scale development to guide item revision/deletion. Many 
item-level fit indices have been proposed in literature, yet none of them were directly applicable 
to an important family of models, namely, the higher order item response theory (HO-IRT) 
models. In this study, chi-square-based fit indices (i.e., Yen’s Q;, McKinley and Mill's G?, Orlando 
and Thissen’s S-X*, and $-G’) were extended to HO-IRT models. Their performances are evalu- 
ated via simulation studies in terms of false positive rates and correct detection rates. The 
manipulated factors include test structure (i.e., test length and number of dimensions), sample 
size, level of correlations among dimensions, and the proportion of misfitting items. For misfit- 
ting items, the sources of misfit, including the misfitting item response functions, and misspecify- 
ing factor structures were also manipulated. The results from simulation studies demonstrate 
that the $-G? is promising for higher order items. 


Keywords 


higher order IRT models, item fit, $-X’, S-G’, false positive rate, correct detection rate 


Introduction 


Item response theory (IRT) models have gained widespread use since their introduction. 
Originally, it was assumed that the latent trait is unidimensional (Baker & Kim, 2004). More 
recently, multidimensional IRT models are presented to relax such an assumption (Reckase, 
2009). Nonetheless, neither unidimensional IRT models nor multidimensional IRT models cap- 
ture the hierarchical nature of latent traits, in which the multiple, domain-level latent traits are 
related to a higher order general trait. de la Torre and Song (2009) proposed a higher order item 
response theory (HO-IRT) model that captures the overall and domain-specific abilities. By 
positing a higher order structure, the HO-IRT model has been shown to measure domain- 
specific abilities and estimate item parameters better than the typical multidimensional IRT 
models. In recent years, the HO-IRT model has been used in a wide variety of domains. For 
instance, it is used to evaluate examinees’ abilities in computerized adaptive testing (Huang, 


'Northeast Normal University, Changchun, Jilin, China 
University of Minnesota, Minneapolis, MN, USA 


Corresponding Author: 

Jian Tao, Key Laboratory for Applied Statistics, School of Mathematics and Statistics, Northeast Normal University, 5268 
Renmin Street, Changchun, Jilin, 130024, China. 

Email: taoj@nenu.edu.cn 


Zhang et al. 645 


Chen, & Wang, 2012; Lee, 2014; Wang, 2014); to measure testlet-based items (Huang & 
Wang, 2013); to evaluate hierarchical latent traits (Huang, Wang, Chen, & Su, 2013); to ana- 
lyze sparse, multigroup data for integrative data analysis (Huo et al., 2015); to assess longitudi- 
nal data (Huang, 2015); to measure academic growth within both IRT and structural equation 
modeling (SEM) frameworks (Wang, Kohli, & Henn, 2016); and to integrate with mixture IRT 
models to account for subclasses within a population (Huang, 2017). 

When a parametric model is fitted to data, item-level fit is usually assessed to guide item 
revision/deletion. To evaluate item fit, numerous statistical procedures have been introduced in 
IRT literature (Bock,1972; Bock & Haberman, 2009; Chon, Lee, & Dunbar, 2010; Demars, 
2005; Glas & Suarez-Falcon, 2003; Haberman, 2009; Haberman, Sinharay, & Chon, 2013; 
Kang & Chen, 2008; LaHuis, Clark, & O’Bruen, 2011; Li & Rupp, 2011; Liang & Wells, 2009; 
McKinley & Mills, 1985; Muraki & Bock, 1997; Orlando & Thissen, 2000, 2003; Ranger & 
Kuhn, 2012; Roberts, 2008; Sinharay, 2005, 2006; Stone, 2000; Stone & Zhang, 2003; Suarez- 
Falcon & Glas, 2003; Wang, Shu, Shang, & Xu, 2015; Wells & Bolt, 2008; Yen, 1981; Zhang 
& Stone, 2008). Among them, chi-square-based item fit indices (i.e., Qj, G?, S-X°, and S-G’) 
are the most popular family of statistical indices and they have been used to examine model 
misspecification under dichotomous or/and polytomous items (Chon et al., 2010; Kang & Chen, 
2008; Yen, 1981; Liang & Wells, 2009; McKinley & Mills,1985; Orlando & Thissen, 2000; 
Wang et al., 2015). S-X* and Q, were used to test violation of the monotonicity assumption of 
the item response function (IRF; Orlando & Thissen, 2003), or to test the item misfit due to 
model misspecification or Q-matrix misspecification in latent class models (Wang et al., 2015). 
Stone’s al and G*’ were used to detect item parameter drift (LaHuis et al., 2011; Stone & 
Zhang, 2003). Also, S-X° was used to identify the factor structure of the test, namely, to differ- 
entiate simple from complex multidimensional structure (Zhang & Stone, 2008), or to differ- 
entiate unidimensional from bifactor or multidimensional structure (Li & Rupp, 2011). 

Furthermore, Haberman (2009) used generalized residuals to assess item fit based on ¢ statis- 
tics. Different from the previous methods (i.e., chi-square-based statistics or generalized resi- 
duals) which used different forms of discrepancy measures to quantify the discrepancy between 
model prediction and observation, the Lagrange multiplier (LM) test (Glas & Suarez-Falcon, 
2003) was proposed to identify item misfit due to violation of local independence. Under 
Bayesian framework, the posterior predictive model checking (PPMC) method (Sinharay, 2005, 
2006) was also used to assess item fit. Different forms of discrepancy measures were considered 
within the PPMC method framework, such as Yen’s Q, (Sinharay, 2005; Wang et al., 2015), 
Stone’s x” (Wang et al., 2015), and Orlando and Thissen’s S-X* and S-G’ (Sinharay, 2006). 

Moreover, the root integrated squared error (RISE, Douglas & Cohen, 2001) was presented 
to investigate the fit of parametric IRT models by comparing them with models fitted under 
nonparametric assumptions. To test item misfit due to model misspecification, RISE outper- 
formed G? and S-X° in that it controlled Type I error rates and provided adequate power (Liang 
& Wells, 2015; Liang, Wells, & Hambleton, 2014; Wells & Bolt, 2008). 

On the contrary, the limited information fit statistics (Bartholomew & Leung, 2002; Cai, 
Maydeu-Olivares, Coffman, & Thissen, 2006; Maydeu-Olivares & Joe, 2005; Reiser, 2008) 
used the marginal tables (i.e., the cross tabulations of item pairs or item triplets), rather than fre- 
quencies of single response patterns in chi-square-based item fit statistics to identify misfit. 
These sets of indices were often used to assess item fit in sparse contingency tables (Cai et al., 
2006; Maydeu-Olivares & Joe, 2005), detect local independence (Liu & Maydeu-Olivares, 
2013), and identify the source of misfit (Liu & Maydeu-Olivares, 2014). 

While all above-cited literature focused on assessing item-level fit for nonhierarchical mod- 
els, the main focus of this article is to assess item-level fit for one kind of hierarchical models, 
that is, HO-IRT models. Li and Rupp (2011) is the only prior study that evaluated the 


646 Applied Psychological Measurement 42(8) 


performance of multivariate S-X° with the bifactor model, the multidimensional item response 
theory (MIRT) model, and the unidimensional item response theory (UIRT) model. However, 
their study is limited in the following aspects: (a) Only two dimensions were considered, and 
the performance of S-X° with more than two dimensions was not examined; (b) the perfor- 
mance of S-X* for comparing the bifactor model with the HO-IRT model was not mentioned; 
and (c) the recursive algorithm which was needed to compute S-X° for hierarchical models was 
not developed in full generality (Cai, 2015). 

The primary goals of this study, thus, are to (a) investigate the performances of four chi- 
square-based indices: Q,, G’, S-X”, and S-G’ to detect the item misfit due to misspecified IRFs 
where the items conform to the HO-IRT model and (b) to examine the performances of the 
indices to detect item misfit when misfit is induced from a different factor structure, such as 
bifactor structure and correlated-factor structure. 

The remainder of this article is organized as follows: First, the authors review the higher 
order IRT model. Second, they present the chi-square-based item fit indices for HO-IRT mod- 
els. Third, the simulation studies are provided to illustrate the performances of those indices. 
Finally, they end with some concluding remarks. 


Method 
Model Description 


This section introduces the notations and the model descriptions for the higher order IRT mod- 
els. Interested readers can refer to de la Torre and Song (2009) for a full description. 

Usually in the HO-IRT framework, a test is assumed to have a multi-unidimensional struc- 
ture, in which each item measures one domain-specific ability, and in total, there are T domain- 
specific abilities. Then an overall ability is extracted from the 7 domain-specific abilities to 
explain the common variance among them. The HO-IRT model consists of a measurement 
model (e.g., the three-parameter logistic model expressed in Equation 1) and a higher order 
dimensional structure (expressed in Equation 2). Mathematically, the model is specified as 


1-<¢; 
ijt =P (Vije = 143, B;, Ci, 84) =; + > 1 
Pu POW I ) 1 +exp|—D(ai9 — bi)| e 
Bit = NG + jt, (2) 
where yj; is the response of examinee j (j= 1,..., N) on itemi (i= 1,..., 2) which measures 
domain ¢ (t= 1,..., 7); & and 0;, are the overall and tth domain-specific abilities; a; b;, and c; 


are the discrimination, threshold, and guessing parameters for item i, respectively; D is a scaling 
constant, which is set to 1.7; A, is the latent regression coefficient of the domain-specific ability 
0;, which derives from the correlation (p) between abilities (i.e., 4; = \/p); and lastly, ¢; is the 
residual term with respect to 0;, conditioning on the overall ability &;. For identification purpose, 
& ~N(0,1) and ¢~N(0, 1 — ?) so that both domain-specific abilities and overall abilities are 


put on the same metric (de la Torre & Song, 2009), that is, the standard normal metric. 


Chi-Square-Based Item-Fit Indices 


According to Hambleton, Swaminathan, and Rogers (1991); Stone (2000); and Orlando and 
Thissen (2000), a common strategy for assessing item fit of an IRT model can be summarized 
as follows. (a) Estimate the model parameters (i.e., item parameters, ability parameters, coeffi- 
cients, and residuals in this study) from a dataset, (b) classify the examinees into K subgroups 


Zhang et al. 647 


according to parameter estimates or test scores, (c) calculate the observed and predicted propor- 
tions of correctly/incorrectly responses for each item and each subgroup, and (d) calculate chi- 
square-based statistics by computing the discrepancy between observed and predicted values. 
The following four item-fit indices all follow this set of steps for evaluating item-level fit. The 
difference lies in how the discrepancy is calculated and how examinees are grouped. 


The traditional chi-square-based fit indices. Both Q, and G’ are considered to be traditional chi- 
square-based fit indices. For both of them, examinees are rank-ordered and partitioned into 10 
homogeneous subgroups according to their overall ability estimations, as 10 subgroups are suf- 
ficient to produce a robust estimation of the discrepancy (Yen, 1981). 

Yen’s (1981) QO, for a dichotomous item i has the form 


pee Ei) + ye Oix) — (1— Eig)? SA Ne(Oin — Ein)? 3) 


1— Ex : Fx(l — Eig)” 


k=1 k=1 


where k (k= 1,..., 10) represents a homogeneous group of examinees, and N;, is the number of 
examinees in group k. The observed proportions (O,;) are obtained by calculating the proportion 
of examinees in group & who answer item i correctly. The expected proportions (£;;,) are com- 
puted from the model as the mean predicted probability of a correct response in each interval. 
Yen (1981) showed the degrees of freedom (df) associated with QO, equaled 10 — m, where m is 
the number of model parameters excluded ability parameters. With the higher order IRT model, 
note that m equals the number of item parameters plus 1. This additional parameter refers to the 
loading from the second-order factors to the first-order factor; only one loading is considered 
per item because the item displays the simple structure. 

In addition to Q;, McKinley and Mills (1985) constructed a likelihood ratio G’ statistics 
based on computations similar to those used for Q,. The notations are the same as Equation 3. 
The correct/incorrect responses for each subgroup are tallied, and G’ for item i can be com- 
puted as 


@=230N Oxin( 2# +(1—O,)In EUR (4) 
i k ik Ex ik (es : 


k= 
The df associated with G? also equals 10 —m. 


Orlando and Thissen’s S-X” index and S-G? index. One known problem with QO, and G? is that they 
rely on the estimated € for group assignment. Hence, if € is not accurately estimated, examinees 
will likely be misgrouped, leading to ill-behaved fit indices. Orlando and Thissen (2000) pro- 
posed S-X° and S-G* that overcame this limitation. Instead of relying on model-dependent €, 
S-X° and S-G* rely on test scores (i.e., number-correct [NC] scores). In addition, the expected 
correct proportions in both S-X* and S-G? are calculated relying on all response patterns rather 
than the predicted probability of a correct response. 

The key component in S-X* and S-G? is the expected correct proportion conditioning on dif- 
ferent total scores, and Orlando and Thissen’s (2000) original idea can be extended to the HO- 
IRT model and any hierarchical models alike. To be specific, the expected proportion of correct 
response for item i and Ath group has the form 


1, = PP OYf'(k — 110)p(0|k)6@)a0aeé 
: TE pO o(e)dodeE 


(5) 


648 Applied Psychological Measurement 42(8) 


where @ is the vector of domain-specific abilities, € denotes the overall ability, p(@) is the IRF 
of item i, f(k|®) is the NC score posterior distribution (Orlando & Thissen, 2000) for score 
group k, f ‘(k— 1|®) is the NC score posterior distribution for score group k— 1 excluding item i, 
p(9|é) is the conditional posterior distribution of domain-specific ability @ given overall ability 
&, and (é) represents the population distribution of €. In the original multivariate S-X? and 
S-G’, p(8|é)b(E) is combined as a prior distribution of latent traits, which leads to a T-fold inte- 
gral. As every item loads onto only one domain-specific dimension, the integral in Equation 5 
reduces to a two-fold integral. A rectangular quadrature over equally spaced increments of 0 
and € from —4.5 to 4.5 (Stroud, 1974) can be used to approximate the integral in Equation 5. 

Cai’s (2015) Lord—Wingersky algorithm version 2.0 is used to calculate the NC score poster- 
ior distributions. The version 2.0 is an extension of Lord and Wingersky’s (1984) original algo- 
rithm to multiunidimensional structures, that is, 


f(K|9) = Se [[26)190¢ (6) 


spt--+sp=kt=l 


where L(s,|6,) is the NC score posterior distribution for the items which measure the tth dimen- 
sion and the NC score is s,. And ‘3 ‘(k— 1]@) can be calculated similarly to f{(A\@). 

As dichotomous items are focused in this study, the NC score posterior distribution for the 
tth dimension can be obtained by Lord and Wingersky’s (1984) algorithm. For the first item 
measuring the th dimension, L"(0|@,) = 1 — p,(1|0,) and L"(1|0,) = p,(1|@,), where p,(1|0,) is the 
probability of a correct response on the first item measured the fth dimension and L’(s|,) is the 
interim value for the NC score posterior distribution for score group s. Then add the second 
item measuring the fth dimension, L(0|0,) = L'(0\0,)(1 —p2(1|6,)) and L(1|6,) = L’(0|8,)p2(1|8,) 
+L"(1|0,)(1 —p2(1|9,)), where p2(0,) is the IRF of the second item. After adding each item, the 
new L(s|6,) replaces L"(s|0,) for all scores computed for the previous item. Repeating this recur- 
sive process, the following equation was obtained: 


L(s;|8;) =L* (s; — 1|81)pi(8) +L*(s:|8)(1 — pi(91)) 4, (7) 


where p,(8,) is the IRF of the ith item. Below is a pseudo-algorithm that further details the 
algorithm. 


Cai’s Lord—Wingersky Algorithm Version 2.0 for Calculating f(k|@) 


Step 1: Calculate Z(s,6,) for s,=0,...,2,andt=1,..., T using Equation 7, where J, is the 
total number of items measured the ¢th dimension. 

Step 2: Construct a 7-column matrix, §, which consisted of all the possible score patterns 
given total score k. In this matrix, the sum of each row equals the NC score k. The ¢th col- 
umn of § consists of all the possible subscores of the tth dimension. Denote S,, as the ele- 
ment in the Ath row and the tth column. 


ig 
Step 3: Calculate [] L(Sj,|0,) for each h, where L(S;,\0,) is preknown from Step 1, then 
i=l 


J(k|®) can be obtained by summing all of them, as shown in Equation 6. 


Given E;, computed from Equation 5, the expressions of S-X* and S-G for a dichotomous item i 
on an /-item test are as follows: 


Zhang et al. 649 


I-1 2 
Oix — Eix) 
sate yy ee (8) 
me Eix(1 — Eix) 
s @2N Oxin( 2 +(1—O,)In ow (9) 
Pr 2 i) Orkin ik f2poe 


For an J/-item test, based on total score, (+ 1) subgroups are naturally formed. At the two 
extremes, when NC score equals 0 (or J), the proportion of examinees who answered item i cor- 
rectly is always 0 (or 1). Therefore, the expected probabilities for only (7— 1) subgroups need 
to be calculated. In Equations 8 and 9, k (kK=1,...,7—1), Nz, and O;, represent the number of 
examinees in group &, and the observed proportion of item i in group k, respectively, and Ej, is 
the expected proportions of item i in group k. The df associated with S-X* and S-G’ both equals 
I—m-—1, where m is the number of model parameters excluded ability parameters. 


Simulation Studies 


The main purpose of the simulation studies is to examine the performances of chi-square-based 
fit indices for HO-IRT models, including the Q,, G’, S-X°, and S-G’ indices. Two different 
sources of misfit are considered: The first type of misfit relies on different functional forms 
assumed for the item characteristic curve (ICC), and the second type of misfit is due to different 
multivariate latent trait structures, that is, higher order structure, bifactor structure, and simple 
multidimensional structure. To this end, the simulation studies are divided into two parts, each 
one focused on one kind of misfit. 

Table 1 shows the summary of data generation and calibration models for both studies. To 
assess the power of indices, the response data are generated from a mixture of two models 
(denoted as Model4/Modelg). More details will be given in the simulation design section. 

For all conditions, the Markov chain Monte Carlo (MCMC) algorithm (de la Torre & Hong, 
2010; de la Torre & Song, 2009) was used to estimate parameters, and 100 replications were 
conducted per condition. In each Markov chain, there were 10,000 iterations, half as burn-in 
phase and half as sampling phase. False positive rates (FPRs) and correct detection rates 
(CDRs; Wang et al., 2015) were calculated for each condition to evaluate the performances of 
those indices. 

To account for a sampling error associated with expected rejecting rates, as Zhang and Stone 
(2008) suggested, 95% confidence intervals (CIs) for the true rejection rates were reported: 


l—-a))!2 
Clo50 =a+t1.96x a > 


where R is the number of replications (R equals 100 in this study) and a is the significant level 
(which is set to .05 in this study). Hence, the expected 95% CI is [0.01, 0.09]. 


Simulation Design | 


Study 1 was designed to examine the performances of chi-square-based fit indices for HO-IRT 
models to detect the misfit relying on different ICCs. Five factors and their varied conditions 
were considered: (a) generation model, three different combinations of models with different 
mixed ratios (1.e., misfit proportions); (b) test structure (de la Torre & Hong, 2010), 40 items 
measured four dimensions equally, 40 items measured two dimensions equally, or 20 items 


650 Applied Psychological Measurement 42(8) 


Table |. Summary of Data Generation and Calibration Models. 


Study | 


Calibration model Generation model 


IPHO 2PHO 3PHO 2PHO/3PHO IPHO/3PHO IPHO/2PHO 


IPHO FPR FPR&CDR FPR&CDR 
2PHO FPR FPR&CDR 
3PHO FPR 
Study 2 
Calibration model Generation model 
2PHO/bifactor M2PLM 
2PHO FPR&CDR FPR 


Note. FPR = false positive rate; CDR = correct detection rate; |PHO = one-parameter higher order IRT model; 
2PHO = two-parameter higher order IRT model; 3PHO = three-parameter higher order IRT model; M2PLM = 
multidimensional two-parameter logistic model; 2PHO/3PHO = (I — proportion) X 2PHO + proportion x 3PHO; 
|IPHO/3PHO = (I — proportion) X 1PHO + proportion 3PHO; IPHO/2PHO = (I — proportion) x |PHO 

+ proportionX 2PHO; 2PHO/bifactor= (| — proportion) x 2PHO + proportion x bifactor model; 2PHO/M2PLM = (I — 
proportion) X 2PHO + proportion M2PLM. 


measured two dimensions equally; (c) correlation (p) between the domains (de la Torre & 
Hong, 2010), 0.5 (small), 0.7 (medium), or 0.9 (large); (d) sample size (4), 1,000 (medium) or 
2,000 (large); and (e) misfit proportions (Wang et al., 2015), 0.1 (small), 0.2 (medium), and 0.4 
(large), that is, the proportion of misfitting items per domain. The misfitting items were spread 
equally across multiple dimensions. Totally, there were 162 (3 generation models x 3 test 
structures X 3 correlations X 2 sample sizes X 3 proportions) different conditions simulated. 
To generate the response data, the discrimination parameters for each dimension were all dis- 
tributed from logN(0, 0.5), the difficulty parameters were drawn from a standard normal distri- 
bution, and the guessing parameters were generated from Beta(8, 32). The overall abilities were 
simulated from a standard normal distribution. 


Simulation Results | 


FPRs. Tables 2 and 3 displayed the results of FPRs for different conditions. As the FPRs of both 
Q, and G? were much larger than the other two indices, the authors only presented the compari- 
son between S-X° and S-G’, and highlighted the values which exceeded the expected 95% CI; 
the full results were provided as a supplementary file. S-G’ had smaller FPRs except for 3PHO/ 
1PHO combination, and for the other conditions, the misfit proportion and the correlation level 
had nearly no effect, but larger sample size led to higher FPR. For 3PHO/1PHO combination, 
when the misfit proportion was 40%, S-G* had inflated FPR, this inflation would be reduced by 
small correlation and small sample size. This is because both S-X* and S-G? are based on total 
score for grouping; hence, when test length increases, the total number of groups also increases. 
As a result, the frequency table becomes larger, and the number of small observed/excepted fre- 
quencies increases, and small frequency is known to have larger impact on S-G? than S-X? 
(Fienberg, 1979), which leads to much higher FPRs of S-G* than S-X*. Furthermore, when J/T 
(test length/domain) was 40/2 or 20/2, this inflation also appeared for 20% misfit proportion. 
On the contrary, when the generation model was a single model, S-G’ also had smaller FPRs, 


Zhang et al. 65| 


Table 2. False Positive Rates Under GM and CM identical (GM = CM) in Study |. 


40/4 40/2 20/2 

N p Index IPHO 2PHO 3PHO IPHO 2PHO 3PHO- IPHO 2PHO = 3PHO 
1000 9 $x? 1073 .0850 .1105 .1080 .0897 1263 .0914 .0810 .1495 
SG 0417 .0412 .0845 .0405 .0405 .0917 .0714 .0655  .1185 

7 SX? 1022 0910 =.1095. 1040 «0910—Ss««1243,- «0910» «.0940-— 1580 

SG .0395 .0412 .0772 .0355 .0412 .0882 .0540 .0730 .1250 

5 SX? 1025 .0933 1235 1165) = 60915 ISIS) = «1014. 1165 = 1570 

SG  .0403 .0420 .0955 .0393 .0362 .1077 .0640 .0760 .1280 

2,000 .9 S-X? 0985 .0803 .1195  .1028 .0910 .1397 .0880 .0875 .1430 
SG .0502 .0422 .0917 .0493 0400 .1073 .0655 .0700 .1150 


7 SX? 0973 0945. 1245. 0943. 0973s «1467 .0960-—Ss «1030S «1655 
SG? 0442 .0442 .0865 .0447 .0483 .1077 .0755 .0865 .1445 
5 SX? 0985 .0922 .1230 .1127 .1097 1472 0985 .1270 1650 
SG 0367 0445 +.0943 «0522. .0510 =.1065 «0760 =. 1010S. 1505 


Note. The bold values denote the minimal values under each condition. GM = Generation model; CM = Calibration 
model; |PHO = one-parameter higher order IRT model; 2PHO = two-parameter higher order IRT model; 3PHO = 
three-parameter higher order IRT model, Cl = confidence interval. 


and the performance of S-X* was similar to Orlando and Thissen (2000) when J/T was 40/4 and 
Li and Rupp (2011) when //T was 40/2. The FPRs for 3PHO were larger than other two models. 
Across Tables 2 and 3, S-G’ had smaller FPRs than S-X’, but the performance of S-X” was 
more consistent. The influence of sample size on S-G* was more notable than that on S-X°. The 
FPRs were the largest when J/T was 20/2 and the smallest when J/T was 40/4. In other words, 
for the same test length, the fewer domain-specific abilities measured, the smaller the FPRs 
were; when to fix the number of dimensions, larger test length led to smaller FPRs; for the same 
number of domain-specific items, increasing dimensions would reduce the FPRs. On the whole, 
S-G° performed better based on the results of FPRs because more of S-X* conditions exceeded 
the excepted 95% CI, although S-X* performed more consistent across manipulated conditions. 


CDRs. Table 4 provided the results of CDRs for different conditions. The CDRs of Q; and G* 
were not useful because of their inflated FPRs. Hereafter, the performances of S-X? and S-G* 
were compared. As shown in Table 4, lager sample size, higher misfit proportion, and lower cor- 
relation all led to higher CDR. For 3PHO/2PHO combination, the CDRs of S-X* were larger 
than those of S-G’, but the CDRs of both indices were too low to detect misfit. For 3PHO/1PHO 
condition, S-X* also had larger CDR than S-G*, and the values were the largest among these 
three combinations. For 2PHO/1PHO combination, S-X° and S-G’ performed similarly to detect 
misfit. 

Comparing the results of all the conditions, the CDR was the largest when J/T was 40/2 and 
the smallest when //T was 40/4. It implies that, when to fix the number of dimensions, longer 
test length leads to higher CDR. In contrast, for the same test length, reducing the number of 
dimensions helps increase the CDR. Furthermore, if fixing the number of domain-specific items, 
fewer dimensions led to higher CDR. It appears that chi-square-based indices cannot be used to 
detect the sole influence of guessing parameter (due to low CDR of differentiating 2PHO vs. 
3PHO); nevertheless, they are more sensitive in detecting misfit due to misspecification of dis- 
crimination parameters. Without doubt, when the two models differ in both discrimination and 
guessing parameters, the CDR is the highest. 


652 Applied Psychological Measurement 42(8) 


Table 3. False Positive Rates Under GM More Complex Than CM (GM > CM) in Study I. 


GM > CM 
3PHO/2PHO 3PHO/IPHO 2PHO/IPHO 
/D N p Index 10% 20% 40% 10% 20% 40% 10% 20% 40% 


40/4 1,000 9 S-x? 0794 .084 | 0896 .0703 0597 0446 ~=.1014 1016 .0988 
$c? 0375 .0366 .0388 .0361 .0500 8 .1029 0439 .0397  .0450 

7 S-x? 0936 0975 .0883 0797 0597 0413 = .1053 1025 .0908 

SC? 0367 = .0338 .0400 .0339 .0425 .0696 0358 .0425 .0292 

5 S-x? .086 | 0856 1008 0775 0653 0504 0975 0947 = .0925 

SC? 0378 §=.0331 §=©.0517) =.0333' 0341 = .0596 0303 .0338 .0400 

2,000 9 S-x? 0847 .0850 .0883 0719 0534 .0892 .0967 1003 0879 
$e? 0431 .0441 .0458 .0603 .0856 2321 0514 .0591 = .0521 

7 S-x? 0875 0906 1013 0728 0634 .0800 .0922 .0878 .0904 

$C? 0406 .0428 .0458 .0425 .0653 1721 0442 .0444 .0483 

5 S-x? 0897 0906 0963 0797 .0609 0704 .0994 0944 = .0887 


S-C? 0436 .0437 .0471 .0436 .0478~ 1113 0364 .0362 .0404 

40/4 1,000 9 S-x? .088 | .083 | 0917 = .0683 .053 | 0454 ~=.1072 1103 1033 
$C? 0414 .0388 .0367 .0414 .0512 .1017 .0461 .0425 .0475 

7 Sx? 1017 0956 1017 .081 1 0619 0517 = .1097 L179 .0908 

5-Ce? 0394 .0344 .0425 .0369 .0531 .1058 0417 = .0403 =.0292 


5 S-x? 1056 1013 1246 0786 .0703 0487 ~—.1086 1069 1042 

S-C? 0461 .0406 .0512 .0319 .0434 .0696 0447 = .0425 = =.0421 
2,000 9 S-x? 0869 0941 1025 0664 0622 .0908 .1042 1044 = .0875 

$e? 0461 .0450 .0604 .0517 .0953 .2450 0611 .0625 .0579 

7 S-x? 0917 1084 1088 .0750 0625 .0858 .1017 0953 0983 
5-Ce? 0461 .0525 .0629 .0586 .0794 1979 0561 .0547 .0600 

5 S-x? 1092 . 1088 1213 0819 0728 .0867 .1244 .0988 0975 
$C? 0539 §=.0525 .0642 .0533 .0744 A713 0553 .0462 .0554 

20/2. 1000 9 S-x? .0933 0869 0867 ~=—-.0761 0656 0917 = .1022 1044 =.0958 

$e? 0694 .0537 .0575 .0750- «1019 .2075 0689 .0894 .0800 


7 S-x? 0939 1025 0950 0817 0706 .0800 .0789 0869 1108 

$e? 0767) = .0719 =.0592 =.0789 = .0925 .1800 0617 .0663 .0900 

5 S-x? 1039 1138 1058 .0778 0781 .0817 = .1078 1025 1158 

S-C* 0750 .0831 .0767 .0700 .0856 1467 .0750 .0681 .0867 

2,000 9 S-x? .1078 1031 1108 0856 =.1163) = .2625 = 0917 1056 -1300 
$C? 0750 .0819 .0967 = .1039 A719 3858 0772 =.1019 =—-.1308 

7 Sx? LIZ 1019 1333 0917 = =.0969 =.2292)— 0917 1281 -1125 

Se? 0906 = .0719 = .1092 1011 1644 3642 0828 .1000 = .1133 

5 S-x? 1156 1225 1283 0872 =.1100 = =.2125 .0950 1100 -1325 

S-C? 0844 .0906 .0992 .0950 1300 3150 0861 .0981 = .1333 


Note. The bold values denote the minimal values under each condition. GM = Generation model; CM = Calibration 
model; 3PHO = three-parameter higher order IRT model; 2PHO = two-parameter higher order IRT model; |PHO = 
one-parameter higher order IRT model; Cl = confidence interval. 


Simulation Design 2 


Study 2 was designed to further investigate the performances of S-X° and S-G’, which were con- 
sidered acceptable in Study 1, in the context of detecting items conforming to a different factor 
structure. Table 5 showed the different factor structures that were considered in this study. The 
HO-IRT model is a special version of the bifactor model, which adds the proportionality con- 
straints on the general factor and group factor discrimination parameters for each domain, so 
that the HO-IRT model is nested within the bifactor model. Also, the HO-IRT model can be 


Zhang et al. 653 


Table 4. Correct Detection Rates Under GM More Complex Than CM (GM > CM) in Study I. 


3PHO/2PHO 3PHO/IPHO 2PHO/IPHO 


/D N p Index 10% 20% 40% 10% 20% 40% 10% 20% 40% 


40/4 1,000 9 S-x? .2500 2587. 2094S «.7750—Ss 6700 = 6169 =.4400 Ss 4838 4531 
SC? .1950 -1750 1494 7225 6125 5919 -4400 ~—.4700 4625 

wi S-x? .1850 1600 .1600 .6375 5587 5331  .4250 3787 3669 

Se 1300 0925 .0988 5750 4475 493 | 4200 3950 3750 

5 S-x? 1575 -1350 1269 =.6200 = 5475S 64456 = 3250 2913 3187 

SC? 0825 0862 .0819 5100 4188 3794 2825 3013 .3088 

2,000 9 S-X? 4025 3575 .2775 ~—«8250 .7863 7512 5850 .6088 6062 
SC 3500 3125  .2200 «8150 .7750 .7406 -6050 .6150 6150 

wi S-x? 2875 2562 = .2025. 7375 7163 6719 .5700 .5363 5594 

SC .2700 1888 1581 7150 6613 6462 5650 .5400 5594 

5 S-x? 1675 1812 1812 — .6900 .6288 5837 4225 4338 4537 

ne 1050 -1300 1138 =) 6025 5863 5706 .4450 4238 .4400 
40/2. 1,000 9 S-x? .2650 2488) = .2238 = «7425. .7100—S 61945175 = 4875 -4894 
SC? 1925 -1950  .1525  .6600 6613 5837 4825 5038 .4894 

wi S-X? 2525 2188 .1800 .7100 .6538 .5813 .4275 .4400 3669 

Se 1925 -1375 0.1156) 6225 5813 5319 .4600 .4350 3750 

5 S-x? 1775 -1863 .1700 .6425 6275 .508 | 3550 3912 3688 

Se 1300 1075 =.1019 = .5500 5525 463 | 3150 3937 3569 

2,000 9 S-x? 3850 896.3675) 3113. 8475S 68063) £7712 ~—— 6175 .6375 6125 
SC? 3225 3225 2575 8225 .7837 7581 -6325 .6600 .6219 

7 S-x? 3175 3250 .2581 .8200 .7588 .7244 5475 5713 5869 

SC .2450 2675 .2050 __—-.7800 .7087 .6950 5275 6012 .6094 

5 S-x? .2800 2600 .1944  .7650 -7300 = .6675 5350 5138 5406 

SC? 2175 1913) 1400) 7475 7025 6525 5150 5050 5419 

20/2 10009 S-X? 3200 2425 .2188 .7500 .7675 .6875 .5300 5475 5550 
Se .2950 2125 .2062 ~=.7200 .7400 .6637 5450 5625 5463 

wi S-x? .2550 2125 .2037. = .7400 =.6775 ~=—.6200 5500 .5200 5313 

SC? .2250 -1800 .1737  ~.6350 6625 5888 5550 5225 5425 

5 S-x? .1850 -1450 .1750 ~—-.5950 6325 5663 .4850 .4550 .4500 

SC . 1600 L125 14255250 .6000 5700 4950 4875 4663 

2,000 9 S-x? 3950 3525  .2950 8350 .8200 = .7837 6150 -6925 = .6725 
SC? .3700 3550 .2900 .8450 .8025 .7950 -6300  .6750 6863 

wi S-X? 3600 3000 = =.2600 §=.7700 .8275 = .7588 6150 .6200 .6550 

SC? 3450 2625 .2425~—-.7550 8225 .7800 .6500 .6500 6713 

5 S-x? .2800 .2375 = .2200 _—«.8050 7375 .6900 6150 66575 6025 

SC? .2350 .2300 = =.2025~—-.7850 .7300 6937 -6100 .6650 .6050 


Note. The bold values denotes the maximum values under each condition. GM = Generation model; CM = Calibration 
model; 3PHO = three-parameter higher order IRT model; 2PHO = two-parameter higher order IRT model; |PHO = 
one-parameter higher order IRT model. 


considered nested within the simple-structure MIRT model when the number of dimensions, D, 
exceeds 3; when D = 3, the two models will give exactly the same fit. 

According to the results of Study 1, only one level sample size (i.e., N = 1,000) and one level 
test structure (i.e., 40 items measured four dimensions equally) were considered. There were 10 
(1 MIRT condition + 9 bifactor conditions: 3 correlations X 3 proportions) different conditions 
simulated. Because the difference between MIRT and the HO-IRT models only emerges when 
we look at the factor covariance matrix, and item parameters of both models have the same 
meanings. Hence, if the true data were generated from the MIRT model and retrofitted with the 
HO-IRT model, one would expect the detection rate of S-X° and S-G’ be close to the nominal 


‘Asoay? asuodsay Wa}! [RUOIsUaWIPHINW = [YIP ‘Asoay? asuodsey Wat Japso JeYsIY = [Y|-OH ‘2|dwexe ue se aunjon.ys UOISUBWIP-INOJ are] ‘a20N 


Ix € T/ (1-1) XL+IX 7 L+IX 7 JoIOUeIe dH 
(a =INCCC algo =!) (a =n =afyor =!) 
queysuod Buljeds e ‘Gg queysuod Buljeos & ‘G 
Auge dno (te psUne pals) lenpise. Ys 
Ayjigqe jessues ‘0 queysuod Buljeds & ‘G Ayige jje4saao “3 
dajouesed AynsWyIp “gq Ange “9 Ayyige ayioeds-urewop “9 
Jajauesed uoneuiuliosip dnous “pb dajyaweed Aynoyyip “gq Jayauresed AYNdIYIP “gq 
daqyawueued UOMeUIWIDSIP [e4auas O'p dayauesed uoNeUIWLosip “Dp daqyauesed uoMeulWiiosip “Dp uoneIoN 
| 0 "9 [ = © 28) 79 rg 
| | ST (Pi "27 ro a eer ae 
| 0 NAW 19 : "% ; : NAW~ |: Oc: 4 IN 0 19 
es) °9 a | 0 19 yee Ty | 0 3 | aungonags sqesg queze7 


[(4—"o%0 + 9) q—]axo + | [(4-%69)a—]ex0 + | [(4-%6 a] + | 


~ (%9 “og “Ig pb ‘Op |= MA) a (9 “lg “'p| |= Uh) g SS (*9 “Ig “| |= MK) uonerjnudo4 


OF x ATE 
( OE HICK, 0€ x-IZq 
Ni Oc A= 5 Co) OT R= 


OL -I Olx-T 


94ANJINAYS SO] 


domes IW Lul-OH 


“SOUNIINAIS [LUOISUaLUIPIIN,Y JUe4ayiq Suowy uosiedwo> jo Asewwns *¢ aqeL 


654 


Zhang et al. 655 


Table 6. Comparison Between S-X” and $-G” When the Misspecifying Items Have a Bifactor Structure. 


10% 20% 40% 
p Index FPR CDR FPR CDR FPR CDR 
9 S-x? 0847 1100 .0894 0825 .0950 .0869 
Se 0397 0525 0453 0362 0454 .0400 
7 S-x? 0919 .0700 1016 1225 1021 1256 
S-C 0436 .0300 .0494 0650 047] .0737 
5 S-x? 0942 .1400 1166 1375 1225 1431 
SC? .0400 .1600 .0537 1388 0579 1275 


Note. FPR = false positive rate; CDR = correct detection rate. 


level. Therefore, in this case, the authors do not manipulate the proportion of misfit, but rather 
assume all items conform to the MIRT model. This provides an additional sanity check on the 
proposed indices. To generate the response data using the MIRT model, the discrimination para- 
meters were all distributed from logN(0, 0.5), the difficulty parameters were drawn from a stan- 
dard normal distribution, and the correlations between two different dimensions were all set to 
0.5. 

On the contrary, the difference between the bifactor and HO-IRT models could show up at 
item level. Therefore, when the misfitting item had a bifactor structure, the mixed models with 
three different mixed ratios were considered as the data generation model. The selections of cor- 
relations (p) and misfit proportions were the same as those in Study 1. To generate the response 
data, the discrimination parameters were all distributed from logM(0, 0.5), and the difficulty 
parameters were all drawn from a standard normal distribution. And for the misfitting items 
conforming to the bifactor structure, the corresponding group discrimination parameter was 
regenerated from log(0, 0.5), and the general discrimination parameter was the same as the 
original discrimination parameters in the HO-IRT model. The abilities for each domain were all 
generated from a standard normal distribution. 


Simulation Results 2 


When the misfitting items were generated from the bifactor model (Table 6), S-G? had smaller 
FPRs than S-X* under all conditions, which was consistent with Study 1, but the manipulated 
factors did not have a consistent effect on the FPRs. Regarding CDR, S-X” had smaller CDRs 
than S-G’ except for the condition with medium to large correlation and high misfit proportion. 
As one would expect, higher correlation makes both the bifactor model and the HO-IRT models 
close to a UIRT model, and the distinction between them becomes so small that the CDR is 
low. Higher misfit proportion also led to smaller CDR. One explanation is, when there is a large 
proportion of misfitting items, the item parameter estimates would likely be biased due to the 
contamination of the misfitting items, and therefore, the item-level misfit detection becomes 
more difficult. In fact, this is known as the disadvantage of almost all residual-based (or discre- 
pancy-based) fit indices. CDR was relatively low across all conditions with values ranging from 
.083 to .143 for S-X* and from .030 to .160 for S-G*. This is not surprising because when fitting 
the HO-IRT model to data simulated from a combination of the HO-IRT and the bifactor model, 
recovery of the entire test is acceptable, despite the misfitting items cannot be recovered well. 
Indeed, the observed results further support that these indices would not have inflated FPRs. 


656 Applied Psychological Measurement 42(8) 


As a reference, when the response data were generated solely from the bifactor model and 
fitted with the HO-IRT model (i.e., the misfit proportion was 100%), the CDRs of S-X* and 
S-G° were .285 and .257, respectively. These values are higher than those reported in Li and 
Rupp (2011), in which they tried to detect item-level misfit when data were generated from the 
bifactor model and fitted with the MIRT model. 

When the response data were generated from the MIRT model and the HO-IRT model was 
used to fit the data (i.e., the misfit proportion was 100%), the CDRs of S-X* and S-G? were 
.147 and .098, respectively, which were consistent with Li and Rupp (2011). This observed low 
power is not unexpected, however, because the parameter estimates from the ‘‘misfitting’’ HO- 
IRT model were actually close to the true MIRT model parameters. In particular, the root mean 
square error between the estimated and true domain-specific abilities was in the range of .448 
to .498, and the bias was in the range of —.026 to —.009. Item parameter recovery was also 
acceptable. Due to the close resemblance between the HO-IRT and MIRT models, the low 
CDR is actually reassuring because it implies the FPR is well controlled. 


Discussion 


Before any model-based inferences can be drawn, the model’s fit must be thoroughly assessed, 
because any conclusion derived from poorly fitting models may be potentially misleading. In 
practice, the item-fit will not be analyzed solely. Actually, the model-data fit at the global model 
level must be investigated firstly using model fit indices, when a model does not fit well, alter- 
native models might be fitted. However, more often than not, no such model provides a good fit 
(Liu & Maydeu-Olivares, 2014). Facing this situation, researchers have to differentiate well- 
fitting items from poorly fitting ones; then they may decide to retain only the well-fitting set or 
to apply an alternative IRT model to the poorly fitting set on the basis of item fit analysis. In 
other words, item-level fit analysis not only serves as a complementary check to global fit analy- 
sis, it is also essential in scale development because the fit result will help guide item revision 
or deletion (Liu & Maydeu-Olivares, 2014). Although there are abundant research focusing on 
item-level fit evaluation for both unidimensional and multidimensional IRT models, there is 
lack of an effective item-level fit index for hierarchical models, and the aim of the present study 
is to fill in this gap. Moreover, there is also not enough information on which version of chi- 
square-based item fit indices is recommended for HO-IRT models under different conditions. 
Hence, another main purpose of this study was to compare the performances of chi-square-based 
item fit indices for HO-IRT models. Last but not least, there is rare research on how item fit 
indices perform to detect the misfit relying on different latent trait structure; the other goal of 
this study was to examine the power of item fit indices to compare among HO-IRT models, 
bifactor models, and MIRT models. 

Across all simulation conditions, S-G’ is recommended for HO-IRT models due to its smaller 
FPR and adequate CDR. Similar to the findings reported in the literature, both S-X° and S-G? 
perform poorly to detect model misspecification due to guessing behavior at the item level and 
perform well to detect different discrimination scales. Furthermore, both S-X* and S-G’ perform 
too poorly to detect item misspecification due to multivariate structure of latent traits. 

Because S-X* and S-G’ performed well in this study, when the source of misfit was the inac- 
curate functional form assumed for the ICC, it may be useful to extend the procedures to other 
hierarchical models such as third-order IRT models (Rijmen, Jeon, von Davier, & Rabe- 
Hesketh, 2014), response-time models (van der Linden, 2009), and multilevel IRT models (Fox 
& Glas, 2001). Also, it is easy to extend the study to deal with polytomous data. On the con- 
trary, as Li and Rupp (2011) compared the performances of model fit indices to detect the mis- 
fit at the global model level, the performances of model fit indices should be further compared 


Zhang et al. 657 


to detect the misfit at the local item level. In addition, it is not rare for latent traits to distribute 
nonnormally such as in personality or psychopathology measures (Micceri, 1989); the conse- 
quences of normality violation on item-level fit should be further assessed. Finally, it would be 
worthwhile to compare chi-square-based indices with other fit statistics (Douglas & Cohen, 
2001; Glas & Suarez-Falcon, 2003; Haberman, 2009; Haberman et al., 2013; Sinharay, 2006) 
for hierarchical models. 


Acknowledgments 


The authors thank the Editor in Chief Dr. Hua-Hua Chang, the Associate Editor Dr. Daniel Bolt and two 
anonymous reviewers for their helpful comments on earlier drafts of this article. 


Declaration of Conflicting Interests 


The author(s) declared no potential conflicts of interest with respect to the research, authorship, and/or pub- 
lication of this article. 


Funding 


The author(s) disclosed receipt of the following financial support for the research, authorship, and/or pub- 
lication of this article: This research was supported by the National Natural Science and Social Science 
Foundations of China (Grant 11571069) and Institute of Education Sciences (IES) (Grant R305D160010). 


Supplemental Material 


Supplemental material is available for this article online. 


References 


Baker, F. B., & Kim, S.-H. (2004). Jtem response theory: Parameter estimation techniques (2nd ed., 
Revised and expanded). New York, NY: Marcel Dekker. 

Bartholomew, D. J., & Leung, S. O. (2002). A goodness of fit test for sparse 2p contingency tables. British 
Journal of Mathematical and Statistical Psychology, 55, 1-15. 

Bock, R. D. (1972). Estimating item parameters and latent ability when responses are scored in two or 
more nominal categories. Psychometrika, 37, 29-51. 

Bock, R. D., & Haberman, S. J. (2009, July). Confidence bands for examining goodness-of-fit of estimated 
item response functions. Paper presented at Annual Meeting of the Psychometric Society, Cambridge, 
UK. 

Cai, L. (2015). Lord-Wingersky algorithm version 2.0 for hierarchical item factor models with applications 
in test scoring, scale alignment, and model fit testing. Psychometrika, 80, 535-559. 

Cai, L., Maydeu-Olivares, A., Coffman, D. L., & Thissen, D. (2006). Limited-information goodness-of-fit 
testing of item response theory models for sparse 2” tables. British Journal of Mathematical and 
Statistical Psychology, 59, 173-194. 

Chon, K. H., Lee, W. C., & Dunbar, S. B. (2010). A comparison of item fit statistics for mixed IRT 
models. Journal of Educational Measurement, 47, 318-338. 

de la Torre, J., & Hong, Y. (2010). Parameter estimation with small sample size: A higher-order IRT 
model approach. Applied Psychological Measurement, 34, 267-285. 

de la Torre, J., & Song, H. (2009). Simultaneous estimation of overall and domain abilities: A higher-order 
IRT model approach. Applied Psychological Measurement, 33, 620-639. 

Demars, C. E. (2005). Type I error rates for Parscale’s fit index. Educational and Psychological 
Measurement, 65, 42-50. 


658 Applied Psychological Measurement 42(8) 


Douglas, J., & Cohen, A. S. (2001). Nonparametric item response function estimation for assessing 
parametric model fit. Applied Psychological Measurement, 25, 234-243. 

Fienberg, S. E. (1979). The use of chi-squared statistics for categorical data problems. Journal of the Royal 
Statistical Society: Series B (Statistical Methodological), 41, 54-64. 

Fox, J. P., & Glas, C. A. (2001). Bayesian estimation of a multilevel IRT model using Gibbs sampling. 
Psychometrika, 66, 271-288. 

Glas, C. A. W., & Suarez-Falcon, J. C. (2003). A comparison of item-fit statistics for the three-parameter 
logistic model. Applied Psychological Measurement, 27, 87-106. 

Haberman, S. J. (2009). Use of generalized residuals to examine goodness of fit of item response models. 
ETS Research Report Series, 2009(1), 1-17. 

Haberman, S. J., Sinharay, S., & Chon, K. H. (2013). Assessing item fit for unidimensional item response 
theory models using residuals from estimated item response functions. Psychometrika, 78, 417-440. 
Hambleton, R. K., Swaminathan, H., & Rogers, H. J. (1991). Fundamentals of item response theory. 

Newbury Park, CA: Sage. 

Huang, H. Y. (2015). A multilevel higher order item response theory model for measuring latent growth in 
longitudinal data. Applied Psychological Measurement, 39, 362-372. 

Huang, H. Y. (2017). Mixture IRT model with a higher-order structure for latent traits. Educational and 
Psychological Measurement, 77, 275-304. 

Huang, H. Y., Chen, P. H., & Wang, W. C. (2012). Computerized adaptive testing using a class of high- 
order item response theory models. Applied Psychological Measurement, 36, 689-706. 

Huang, H. Y., & Wang, W. C. (2013). Higher order testlet response models for hierarchical latent traits 
and testlet-based items. Educational and Psychological Measurement, 73, 491-511. 

Huang, H. Y., Wang, W. C., Chen, P. H., & Su, C. M. (2013). Higher-order item response models for 
hierarchical latent traits. Applied Psychological Measurement, 37, 619-637. 

Huo, Y., de la Torre, J., Mun, E. Y., Kim, S. Y., Ray, A. E., Jiao, Y., & White, H. R. (2015). A hierarchical 
multi-unidimensional IRT approach for analyzing sparse, multi-group data for integrative data analysis. 
Psychometrika, 80, 834-855. 

LaHuis, D. M., Clark, P., & O’Brien, E. (2011). An examination of item response theory item fit indices 
for the graded response model. Organizational Research Methods, 14, 10-23. 

Lee, M. (2014). Application of higher-order IRT models and hierarchical IRT models to computerized 
adaptive testing (Electronic Theses and Dissertations). University of California, Los Angeles. 

Li, Y., & Rupp, A. A. (2011). Performance of the S-X2 statistic for full-information bifactor models. 
Educational and Psychological Measurement, 71, 986-1005. 

Liang, T., & Wells, C. S. (2009). A model fit statistic for generalized partial credit model. Educational and 
Psychological Measurement, 69, 913-928. 

Liang, T., & Wells, C. S. (2015). A nonparametric approach for assessing goodness-of-fit of IRT models in 
a mixed format test. Applied Measurement in Education, 28, 115-129. 

Liang, T., Wells, C. S., & Hambleton, R. K. (2014). An assessment of the nonparametric approach for 
evaluating the fit of item response models. Journal of Educational Measurement, 51, 1-17. 

Liu, Y., & Maydeu-Olivares, A. (2013). Local dependence diagnostics in IRT modeling of binary data. 
Educational and Psychological Measurement, 73, 254-274. 

Liu, Y., & Maydeu-Olivares, A. (2014). Identifying the source of misfit in item response theory models. 
Multivariate Behavioral Research, 49, 354-371. 

Lord, F. M., & Wingersky, M. S. (1984). Comparison of IRT true-score and equipercentile observed-score 
“‘equatings.”’ Applied Psychological Measurement, 8, 453-461. 

Kang, T., & Chen, T. T. (2008). Performance of the generalized S-X? item fit index for polytomous IRT 
models. Journal of Educational Measurement, 45, 391-406. 

Maydeu-Olivares, A., & Joe, H. (2005). Limited- and full-information estimation and goodness-of-fit 
testing in 2” contingency tables: A unified framework. Journal of the American Statistical Association, 
100, 1009-1020. 

McKinley, R. L., & Mills, C. N. (1985). A comparison of several goodness-of-fit statistics. Applied 
Psychological Measurement, 9, 49-57. 


Zhang et al. 659 


Micceri, T. (1989). The unicorn, the normal curve, and other improbable creatures. Psychological Bulletin, 
105, 156-166. 

Muraki, E., & Bock, R. D. (1997). PARSCALE: IRT item analysis and test scoring for rating-scale data. 
[Computer software]. Chicago, IL: Scientific Software International. 

Orlando, M., & Thissen, D. (2000). Likelihood-based item-fit indices for dichotomous item response 
theory models. Applied Psychological Measurement, 24, 50-64. 

Orlando, M., & Thissen, D. (2003). Further investigation of the performance of S-X2: An item fit index for 
use with dichotomous item response theory models. Applied Psychological Measurement, 27, 289-298. 

Ranger, J., & Kuhn, J. T. (2012). Assessing fit of item response models using the information matrix test. 
Journal of Educational Measurement, 49, 247-268. 

Reckase, M. (2009). Multidimensional item response theory (Vol. 150). New York, NY: Springer. 

Reiser, M. (2008). Goodness-of-fit testing using components based on marginal frequencies of 
multinomial data. British Journal of Mathematical and Statistical Psychology, 61, 331-360. 

Riymen, F., Jeon, M., von Davier, M., & Rabe-Hesketh, S. (2014). A third-order item response theory 
model for modeling the effects of domains and subdomains in large-scale educational assessment 
surveys. Journal of Educational and Behavioral Statistics, 39, 235-256. 

Roberts, J. S. (2008). Modified likelihood-based item fit statistics for the generalized graded unfolding 
model. Applied Psychological Measurement, 32, 407-423. 

Sinharay, S. (2005). Assessing fit of unidimensional item response theory models using a Bayesian 
approach. Journal of Educational Measurement, 42, 375-394. 

Sinharay, S. (2006). Bayesian item fit analysis for unidimensional item response theory models. British 
Journal of Mathematical and Statistical Psychology, 59, 429-449. 

Stone, C. A. (2000). Monte Carlo based null distribution for an alternative goodness-of-fit test statistic in 
IRT models. Journal of Educational Measurement, 37, 158-175. 

Stone, C. A., & Zhang, B. (2003). Assessing goodness of fit of item response theory models: A comparison 
of traditional and alternative procedures. Journal of Educational Measurement, 40, 331-352. 

Stroud, A. H. (1974). Numerical quadrature and solution of ordinary differential equations. New York, 
NY: Springer. 

Suarez-Falcon, J. C., & Glas, C. A. W. (2003). Evaluation of global testing procedures for item fit to the 
Rasch model. British Journal of Mathematical and Statistical Psychology, 56, 127-143. 

van der Linden, W. J. (2009). Conceptual issues in response-time modeling. Journal of Educational 
Measurement, 46, 247-272. 

Wang, C. (2014). Improving measurement precision of hierarchical latent traits using adaptive testing. 
Journal of Educational and Behavioral Statistics, 39, 452-477. 

Wang, C., Kohli, N., & Henn, L. (2016). A second-order longitudinal model for binary outcomes: Item 
response theory versus structural equation modeling. Structural Equation Modeling: A 
Multidisciplinary Journal, 23, 455-465. 

Wang, C., Shu, Z., Shang, Z., & Ku, G. (2015). Assessing item-level fit for the DINA model. Applied 
Psychological Measurement, 39, 525-538. 

Wells, C. S., & Bolt, D. M. (2008). Investigation of a nonparametric procedure for assessing goodness-of- 
fit in item response theory. Applied Measurement in Education, 21, 22-40. 

Yen, W. M. (1981). Using simulation results to choose a latent trait model. Applied Psychological 
Measurement, 5, 245-262. 

Zhang, B., & Stone, C. A. (2008). Evaluating item fit for multidimensional item response models. 
Educational and Psychological Measurement, 68, 181-196. 


